CONCEPT DIABETES: Analytical pipeline
Introduction
CONCEPT-DIABETES
CONCEPT-DIABETES is part of CONCEPT, a coordinated research project initiative under the umbrella of REDISSEC, the Spanish Research Network on Health Services Research and Chronic Conditions] [www.redissec.com], that aims at analyzing chronic care effectiveness and efficiency in a number of cohorts built on real world data (RWD). In the specific case of CONCEPT-DIABETES, the focus will be on assessing the effectiveness of a set of clinical practice actions and quality improvement strategies at different levels (patient-level, health-care provider level and health system level) on the management and health results of patients with type 2 diabetes (T2D) using process mining methodology.
It is a population-based retrospective observational study centered on all T2D patients diagnosed in four Regional Health Services within the Spanish National Health Service, that includes information from all their contacts with the health services using the electronic medical record systems including Primary Care data, Specialist Care data, Hospitalizations, Urgent Care data, Pharmacy Claims, and also other registers such as the mortality and the population register. We will assess to what extent recommended interventions from evidence-based guidelines are implemented in real life and which are their effects on health outcomes. Process mining methods will be used to analyze the data, and comparison with standard methods will be also conducted.
Cohort
The cohort is defined as patients with type 2 diabetes:
Inclusion criteria: Patients that, at 2017-01-01 or during the follow-up from 2017-01-01 to 2022-12-31, had active health card (active TIA - tarjeta sanitaria activa) and one of the inclusion codes given in the ‘inclusion code list (’T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES) and none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) in the clinical records of primary care.
Exclusion criteria: Patients that had one the codes in exclusion code lists (‘T90’ if CIAP-2, ‘250’ if ‘CIE-9CM’ or ‘E11’ if CIE-10-ES) and none of the exclusion codes given in the exclusion codes list (‘T89’ if CIAP-2, ‘250.01’ if CIE-9CM, ‘250.03’ if CIE-9CM, ‘250.11’ if CIE-9CM, ‘250.13’ if CIE-9CM, ‘250.21’ if CIE-9CM, ‘250.23’ if CIE-9CM, ‘250.31’ if CIE-9CM, ‘250.33’ if CIE-9CM, ‘250.41’ if CIE-9CM, ‘250.43’ if CIE-9CM, ‘250.51’ if CIE-9CM, ‘250.53’ if CIE-9CM, ‘250.61’ if CIE-9CM, ‘250.63’ if CIE-9CM, ‘250.71’ if CIE-9CM, ‘250.73’ if CIE-9CM, ‘250.81’ if CIE-9CM, ‘250.83’ if CIE-9CM, ‘250.91’ if CIE-9CM, ‘250.93’ if CIE-9CM or ‘E10’ if CIE-10-ES) during their follow-up or patients with no contact with the health system from 2017-01-01 to 2022-12-31.
Study period: 2017-01-01 until 2022-12-31.
Treatment guidelines
One of the main intermediate outcome indicators to which clinical guidelines pay special attention is a good glycaemic control, since its absence is clearly related to micro and macrovascular complications. In clinical practice, suboptimal glycaemic control can be mainly attributed to two main reasons: the patients’ non-adherence to prescribed treatment; and the healthcare providers’ clinical or therapeutic guidelines’ non-adherence.
Treatment decisions are based on glycated hemoglobin measurements. In this context the redGDPS foundation provides DM2 treatment algorithm, a diagram that aims to help professionals to quickly and easily choose the most appropriate treatment for people with diabetes.
Running code
The python libraries used are:
Code
import sys
import pm4py, subprocess
import pandas as pd
import numpy as np
import itertools
import matplotlib.pyplot as plt
import textdistance
import gensim.corpora as corpora
from tqdm import trange, tqdm
from scipy.spatial.distance import squareform
from scipy.cluster.hierarchy import fcluster, linkage, dendrogram
from sklearn.cluster import AgglomerativeClustering
from sklearn.feature_extraction.text import TfidfVectorizer
from yellowbrick.cluster import KElbowVisualizer
from pm4py.objects.petri_net.obj import PetriNet, Marking
from pm4py.visualization.decisiontree import visualizer as tree_visualizer
from pm4py.algo.decision_mining import algorithm as decision_mining
from pm4py.visualization.petri_net import visualizer
from gensim.models import LdaModel
from datetime import datetime, timedelta
from dateutil.relativedelta import relativedelta
import duckdb
import reThe R libraries used are:
Code
library(tidyverse)
library(lubridate)
library(jsonlite)
library(ggplot2)
library(bupaR)
library(processmapR)
library(dplyr)
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
library(here)
library(survival)
library(coxme)
library(muhaz)
library(ggfortify)
library(bayestestR)
library(purrr)Data preprocessing and event log creation
For using process mining an event log is needed. The sort of functions below take data model’s treatment and parameter tables to create an event log of glycated hemoglobin measures and treatment of patients. Glycated hemoglobin measurement events are divided into two different states, those that have a value inferior that 7.5 and the others. Therefore, these events do not have any duration. In the case of treatments on the other hand a duration period exists. For treatments events definition it is based on drugs dispensing and a fixed time period after the dispensing date of the last dispensing date of the same drug if the dispensing is regularly produced . Functions below make an event log taking some assumptions. However the analysis of this document is thought to do by the predominant clinical condition of patients according to DM2 treatment algorithm. In other words, patients are grouped by their predominant clinical condition and the analysis is realized independently to each group.
Code
def select_by_condition(cond):
'''
Filtering data by predominant clinical condition according to redGDPS
Args:
cond (str): condition wanted to analyze
Return:
presc_data_p (float): percentage of not null data in prescription dates
https://www.sciencedirect.com/science/article/pii/S1751991821000176?via%3Dihub#tbl0015
https://www.sanidad.gob.es/estadEstudios/estadisticas/estadisticas/estMinisterio/SIAP/map_cie9mc_cie10_ciap2.htm
obesity (BMI≥30 kg/m2), age older than 75, established CVD (defined as
myocardial infarction, ischemic heart disease, cerebrovascular disease,
or peripheral arterial disease), CKD (defined as eGFR < 60 ml/min/
1.73 m2 and/or UACR ≥ 30 mg/g) and HF.
'''
cmbd_df_path = '../../outputs/dm_cmbd_df_%s.csv' % cond
comorb_df_path = '../../outputs/dm_comorb_df_%s.csv' % cond
param_df_path = '../../outputs/dm_param_df_%s.csv' % cond
patients_df_path = '../../outputs/dm_patients_df_%s.csv' % cond
treat_df_path = '../../outputs/dm_treat_df_%s.csv' % cond
con = duckdb.connect('../../inputs/data.duckdb')
cmbd_df = con.sql("SELECT * FROM main.cmbd WHERE patient_id IN (SELECT patient_id FROM main.patient WHERE dx_date >='2017-01-01')").df()
comorb_df = con.sql("SELECT * FROM main.comorb WHERE patient_id IN (SELECT patient_id FROM main.patient WHERE dx_date >='2017-01-01')").df()
param_df = con.sql("SELECT * FROM main.param WHERE patient_id IN (SELECT patient_id FROM main.patient WHERE dx_date >='2017-01-01')").df()
patients_df = con.sql("SELECT * FROM main.patient WHERE patient_id IN (SELECT patient_id FROM main.patient WHERE dx_date >='2017-01-01')").df()
treat_df = con.sql("SELECT * FROM main.treat WHERE patient_id IN (SELECT patient_id FROM main.patient WHERE dx_date >='2017-01-01')").df()
con.close()
ecv_code_list = ['K76','K75','K76','K91','K92']
ic = 'K77'
erc = 'U99.01'
ob = 'T82'
cie9_df = pd.read_csv('./CIE9.csv')
cie10_df = pd.read_csv('./CIE10.csv')
cie9 = dict(zip(cie9_df.CIE9,cie9_df.BDCAP))
cie10 = dict(zip(cie10_df.CIE10,cie10_df.BDCAP))
comorb_df = comorb_df[comorb_df['comorb_date_end'].isna()]
comorb_df['ciap2'] = comorb_df['comorb_code']
comorb_df.index = range(len(comorb_df))
for n in range(len(comorb_df)):
if comorb_df['comorb_codif'][n]=='ICD9':
comorb_df['ciap2'][n] = cie9.get(comorb_df['comorb_code'][n],None)
elif comorb_df['comorb_codif'][n]=='ICD10':
comorb_df['ciap2'][n] = cie10.get(comorb_df['comorb_code'][n],None)
# Which is the predominal clinical condition at baseline?
baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
patients_df['age'] = [(baseline-
patients_df['month_nac'][n]).days//365
for n in range(len(patients_df))]
ob_list = set(comorb_df['patient_id'][comorb_df['ciap2']==ob])
f_list = set(patients_df['patient_id'][patients_df['age']>75])
erc_list = set(comorb_df['patient_id'][comorb_df['ciap2']==erc])
ic_list = set(comorb_df['patient_id'][comorb_df['ciap2']==ic])
ecv_list = set(comorb_df['patient_id'][n] for n in range(len(comorb_df))
if comorb_df['ciap2'][n] in ecv_code_list)
# Which is the predominal clinical condition at dx_date?
dx_date = dict(zip(patients_df.patient_id,patients_df.dx_date))
cmbd_long = pd.concat(
[cmbd_df[['patient_id','admission_date','diagnosis1_code']
].rename(columns = {'diagnosis1_code':'code'}),
cmbd_df[['patient_id','admission_date','diagnosis2_code']
].rename(columns = {'diagnosis2_code':'code'}).dropna(),
cmbd_df[['patient_id','admission_date','diagnosis3_code']
].rename(columns = {'diagnosis3_code':'code'}).dropna()])
cmbd_long['code'] = cmbd_long['code'].apply(
lambda x: str(x).replace('E66.3','sobrepeso').split('.')[0])
to_ecv_change_list = ['I2%i' %i for i in range(6)]+[
'I6%i' %i for i in range(10)]+[
'I7%i' %i for i in [0,3,4]]
# ENFERMEDADES ISQUÉMICAS CARDIACAS (I20-I25)
# ENFERMEDADES CEREBROVASCULARES (I60-I69)
# ATEROESCLEROSIS (ENFERMEDADES VASCULARES PERIFÉRICAS) (I70)
# OTRAS ENFERMEDADES VASCULARES PERIFÉRICAS (I73)
# EMBOLIA Y TROMBOSIS ARTERIAL (ENFERMEDADES VASCULARES PERIFÉRICAS) (I74)
to_ic_change_list = ['I50','I13']
# INSUFICIENCIA CARDÍACA (I50)
# ENFERMEDAD CARDÍACA Y RENAL CRÓNICA HIPERTENSIVA (I13)
to_erc_change_list = ['N18','I12','filtglom']
# ENFERMEDAD RENAL CRÓNICA (N18)
# ENFERMEDAD RENAL CRÓNICA HIPERTENSIVA (I12)
to_f_change_list = []
to_ob_change_list = ['E66']
# SOBREPESO Y OBESIDAD (E66)
to_else_change_list = []
# https://ukkidney.org/health-professionals/information-resources/uk-eckd-guide/ckd-stages
# Include obesity and kcd detection by parameters values
cmbd_bmi = param_df[param_df['param_name']=='bmi']
cmbd_bmi['param_value'] = cmbd_bmi['param_value'].apply(
lambda x: 'bmi>30' if x>30 else 'bmi<30')
cmbd_bmi['dx_date'] = cmbd_bmi['patient_id'].apply(
lambda x: dx_date[x]).to_list()
dx_previous_bmi = (cmbd_bmi[cmbd_bmi['param_date'] <= cmbd_bmi['dx_date']]
).groupby('patient_id')['param_date'].max().to_dict()
cmbd_bmi = cmbd_bmi[['patient_id','param_date','param_value']
].rename(columns={
'param_value':'code',
'param_date':'admission_date'})
param_filt = param_df[(param_df['param_name']=='filtglom') &
(param_df['param_value']>60)
].sort_values(['patient_id','param_date'])
param_filt['time_diff'] = param_filt.groupby(
'patient_id')['param_date'].diff()
cmbd_filt = param_filt[param_filt['time_diff'] < pd.Timedelta(days=90)
].drop_duplicates(subset='patient_id',keep='first')[
['patient_id','param_date','param_name']
].rename(columns={
'param_name':'code',
'param_date':'admission_date'})
cmbd_long = pd.concat([cmbd_long,cmbd_filt,cmbd_bmi])
cmbd_long.index = range(len(cmbd_long))
cmbd_long['dx_date'] = cmbd_long['patient_id'].apply(
lambda x: dx_date[x]).to_list()
ob_list.update(
set(cmbd_long.query(
"code in @to_ob_change_list & admission_date<dx_date")['patient_id']))
ob_list.update(
set([cmbd_long['patient_id'][n] for n in range(len(cmbd_long)) if
cmbd_long['code'][n]=='bmi>30' and
dx_previous_bmi.get(cmbd_long['patient_id'][n],
'')==cmbd_long['admission_date'][n]]))
erc_list.update(
set(cmbd_long.query(
"code in @to_erc_change_list & admission_date<dx_date")['patient_id']))
ic_list.update(
set(cmbd_long.query(
"code in @to_ic_change_list & admission_date<dx_date")['patient_id']))
ecv_list.update(
set(cmbd_long.query(
"code in @to_ecv_change_list & admission_date<dx_date")['patient_id']))
ecv_list = list(ecv_list)
ic_list = list(ic_list.difference(ecv_list))
erc_list = list(erc_list.difference(ecv_list+ic_list))
f_list = list(f_list.difference(ecv_list+ic_list+erc_list))
ob_list = list(ob_list.difference(ecv_list+ic_list+erc_list+f_list))
else_list = list(set(patients_df['patient_id']
).difference(ecv_list+ic_list+erc_list+f_list+ob_list))
list_cond = eval('%s_list' % cond)
patients_df = patients_df[patients_df['patient_id'].isin(list_cond)]
patients_df.index = range(len(patients_df))
param_df = param_df[param_df['patient_id'].isin(list_cond)]
treat_df = treat_df[treat_df['patient_id'].isin(list_cond)]
comorb_df = comorb_df[comorb_df['patient_id'].isin(list_cond)]
comorb_df.index = range(len(comorb_df))
cmbd_df = cmbd_df[cmbd_df['patient_id'].isin(list_cond)]
cmbd_df.index = range(len(cmbd_df))
# When is a change on predominant clinical condition made?
from_ecv_change_list = []
from_ic_change_list = to_ecv_change_list
from_erc_change_list = from_ic_change_list+to_ic_change_list
from_f_change_list = from_erc_change_list+to_erc_change_list
from_ob_change_list = from_f_change_list+['bmi<30']
from_else_change_list = from_ob_change_list+to_ob_change_list
fin_date = dict()
fin_cause = dict()
death_date = dict(zip(patients_df.patient_id,patients_df.death_date))
for id in list_cond:
cmbd_id = cmbd_long[cmbd_long['patient_id']==id]
try:
date = min(cmbd_id.query(
'code in @from_%s_change_list & admission_date>=dx_date' % cond
)['admission_date'])
cause = list(cmbd_id[cmbd_id['admission_date']==date]['code'])
if str(death_date[id])!='NaT':
if death_date[id]<=date:
date = death_date[id]
cause = ['death']
if cond in ['ob','else']:
turn_75_date = list(patients_df['month_nac'][
patients_df['patient_id']==id])[0]+relativedelta(years=75)
if turn_75_date<date:
date = turn_75_date
cause = ['turn_75']
fin_date[id] = date
fin_cause[id] = str(cause)
treat_df.drop(treat_df[((treat_df['dispensing_date']>date) | (
treat_df['dispensing_date']<dx_date[id])) & (
treat_df['patient_id']==id)].index, inplace=True)
param_df.drop(param_df[((param_df['param_date']>date) | (
param_df['param_date']<dx_date[id])) & (
param_df['patient_id']==id)].index, inplace=True)
except:
if str(death_date[id])!='NaT':
fin_date[id] = death_date[id]
fin_cause[id] = str(['death'])
treat_df.drop(treat_df[(treat_df['dispensing_date']<dx_date[id]) & (
treat_df['patient_id']==id)].index, inplace=True)
param_df.drop(param_df[(param_df['param_date']<dx_date[id]) & (
param_df['patient_id']==id)].index, inplace=True)
patients_df['fin_date'] = patients_df['patient_id'].apply(
lambda x: fin_date.get(x,None)).to_list()
patients_df['fin_cause'] = patients_df['patient_id'].apply(
lambda x: fin_cause.get(x,None)).to_list()
treat_df.index = range(len(treat_df))
param_df.index = range(len(param_df))
patients_df.to_csv(patients_df_path,index=False)
param_df.to_csv(param_df_path,index=False)
treat_df.to_csv(treat_df_path,index=False)
comorb_df.to_csv(comorb_df_path,index=False)
cmbd_df.to_csv(cmbd_df_path,index=False)
presc_data_p = treat_df['prescription_date_ini'].notna().sum()/len(treat_df)
return presc_data_p
def change_matrix_values(matrix,list_rows,list_cols, new_value=0):
'''change selected rows' and columns' matrix`s values
Args:
matrix (array): matrix wanted to modify
list_rows (list): matrix's rows wanted to modify
list_cols (list): matrix's columns wanted to modify
new_value (int): new wanted value
Output:
matrix (array): modified matrix
'''
pos = list(itertools.product(list_rows,list_cols))
for i,j in pos:
matrix[i,j] = new_value
return matrix
def col2treat(matrix,rows,col):
'''translate treatments from a binary vector
Args:
matrix (array): events calendary in binary information
rows (str): events list
col (int): column of matrix wanted to translate
Output:
treatment (str):
'''
treatment = '+'.join(sorted([rows[ev
] for ev in range(len(rows))
if matrix[ev,col]!=0]))
if treatment!='':
return treatment
else:
return '_'
def nac2comprimidos(nac_path,code_set):
'''Creating dictionary of nac drugs container's number of tablets
Args:
nac_path (str): nac table's path
code_set (set): nac code list
code2drug_path (str): drugs' and their codes' info's table's path
Outputs:
nac_dict (dic): nac codes as keys and their number of tablets as values
'''
nac = pd.read_csv(nac_path)
nac['Código Nacional'] = nac['Código Nacional'].astype(str)
nac_dict = {}
for n in range(len(nac)):
if nac['Código Nacional'][n] in code_set:
text = nac['Nombre del producto farmacéutico'][n].lower()
if re.search(r'\d+(?=\s*comprim)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)', text).group())
elif re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text))!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'\d+(?=\s*comprim)',
re.sub(r'\([^()]*\)', '', text)).group())
elif re.search(r'con pelicula (\d+)', text)!=None:
nac_dict[nac['Código Nacional'][n]
] = int(re.search(r'con pelicula (\d+)',
text).group().split()[-1])
return nac_dict
def evlog_creation_prescriptions(treat_df_path='../../outputs/dm_treat_df.csv',
param_df_path='../../outputs/dm_param_df.csv',
patients_df_path='../../outputs/dm_patients_df.csv',
code2drug_path='./diabetes_drugs.csv',
output_file='../../outputs/eventlog_raw.csv'):
'''Preprocessing and event log obtention with prescriptions
Args:
treat_df_path (str): datamodel's treatment table's path
param_df_path (str): datamodel's parameter table's path
patients_df_path (str): datamodel's patients table's path
code2drug_path (str): drugs' and their codes' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
treat_df = pd.read_csv(treat_df_path)
param_df = pd.read_csv(param_df_path)
patients_df = pd.read_csv(patients_df_path)
code2drug = pd.read_csv(code2drug_path,index_col=0).to_dict()
dx_date = dict(zip(patients_df.patient_id,patients_df.dx_date))
fin_date = dict(zip(patients_df.patient_id,patients_df.fin_date))
# definir los tipos de medicamentos a analizar
col_atc = 'atc_code'
col_id = 'patient_id'
col_date_start = 'prescription_date_ini'
col_date_end = 'prescription_date_end'
col_name = 'Event'
treat_df.index = range(len(treat_df))
treat_df[col_name] = [code2drug.get(treat_df[col_atc][c],{'class' : 'ERROR'}
).get('class','ERROR2').replace('+','&'
) if treat_df[col_atc][c].startswith('A'
) else None for c in range(
len(treat_df[col_atc])) ]
treat_df = treat_df.drop(labels=[i for i in range(len(treat_df)
) if treat_df[col_name][i] in [None,'ERROR','ERROR2']],
axis=0).drop_duplicates(subset=[col_id,
col_date_start,
col_date_end,
col_name])
treat_df['actins'] = range(len(treat_df))
treat_df_start = treat_df[[col_id,col_date_start,col_name,'actins']
].rename(columns = {col_date_start:'date'})
treat_df_start['date'] = pd.to_datetime(treat_df_start['date'],
format='%Y-%m-%d')
treat_df_end = treat_df[[col_id,col_date_end,col_name,'actins']
].rename(columns = {col_date_end:'date'})
treat_df_start['cycle'] = 'start'
treat_df_end['cycle'] = 'end'
treat_df_end['date'] = treat_df_end['date'].fillna('2022-12-31').apply(
lambda x: min([x,'2023-01-01']))
treat_df_end['date'] = pd.to_datetime(treat_df_end['date'],
format='%Y-%m-%d').apply(
lambda x: x-timedelta(days=1))
col_param = 'param_name'
col_value = 'param_value'
col_date = 'param_date'
param_df = param_df.drop(labels=[i for i in range(len(param_df)
) if param_df[col_param][i]!='hba1c'],
axis=0)
param_df.index = range(len(param_df))
param_df[col_value] = ['HBA<75' if i<7.5 else 'HBA>75' for i in list(
param_df[col_value])]
param_df.rename(columns = {col_value: col_name,col_date:'date'},
inplace = True)
param_df['actins'] = param_df.index+[max(treat_df['actins'])+1]*len(param_df)
param_df = param_df[[col_id,'date',col_name,'actins']]
param_df['date'] = pd.to_datetime(param_df['date'],
format='%Y-%m-%d')
param_df_start = param_df.copy()
param_df_start['cycle'] = 'start'
param_df_end = param_df.copy()
param_df_end['cycle'] = 'end'
df = pd.concat([param_df_start,treat_df_start,param_df_end,treat_df_end]
).drop_duplicates()
df = df.sort_values(by = [col_id,'date',col_name,'cycle'],
ascending = [True,True,True,False])
df.index = range(len(df))
patient_list = list(set(df[col_id]))
df['nid'] = [patient_list.index(df[col_id][n]) for n in range(len(df))]
df = df.sort_values(by = [col_id, 'date','cycle'],
ascending = [True, True, False])
df.index = range(len(df))
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
days_after_m = 5
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<75'), events.index('HBA>75')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),
datetime.strptime(dx_date[id], "%Y-%m-%d")])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,datetime.strptime(fin_date[id], "%Y-%m-%d")])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='end')])[0]
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
ev_status = ev_status.astype(int)
#delete treatments in measure events
ev_status = change_matrix_values(ev_status,no_hba,measures,0)
#correction to eliminate events after hemoglobin measurement that are
#the continuation of the previous treatment and not the new treatment.
#Example: If a patient has 'A' treatment and because of hemoglobin's
# measure they change their treatment to 'B', originally their
# trace could appear as A>measure>A>B, but the true trace
# should be A>measure>B. Therefore, in a fixed period time
# after each measurement 'days_after_m', if we detect that
# mistake we delete it.
for j in measures:
if ev_status.shape[1]<=j+1 or days_after_m<3:
continue
first_col = ev_status[:,j+1]
dif_col = np.where(np.any(ev_status[:, j+2:j+days_after_m]!=
first_col[:, np.newaxis],
axis=0))[0]
if (dif_col.size > 0) and (not
np.array_equal(ev_status[:,j-1],
ev_status[:,j+dif_col[0]+2])) and (
np.array_equal(ev_status[:,j-1],
ev_status[:,j+1])):
ev_status = change_matrix_values(ev_status,no_hba,
range(j+1,j+dif_col[0]+2),0)
col = dd.index(datetime.strptime(dx_date[id], "%Y-%m-%d"))
col_0 = dd.index(datetime.strptime(dx_date[id], "%Y-%m-%d"))
col_max = dd.index(datetime.strptime(fin_date[id], "%Y-%m-%d")
) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 7
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if ('HBA' not in ev and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if 'HBA' in ev or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
evlog.to_csv(output_file)
def evlog_creation_dispensations(treat_df_path='../../outputs/dm_treat_df.csv',
param_df_path='../../outputs/dm_param_df.csv',
patients_df_path='../../outputs/dm_patients_df.csv',
code2drug_path='./diabetes_drugs.csv',
nac_path='./Nomenclator_de_Facturacion.csv',
output_file='../../outputs/eventlog_raw.csv'):
'''Preprocessing and event log obtention with dispensations
Args:
treat_df_path (str): datamodel's treatment table's path
param_df_path (str): datamodel's parameter table's path
patients_df_path (str): datamodel's patients table's path
code2drug_path (str): drugs' and their codes' info's table's path
ADNI: ANTIDIABETICOS NO INSULINICOS
The treatment of type 2 diabetes mellitus with ADNI includes a wide range of
drugs which, depending on their drugs which, according to their mechanisms of
action, can be grouped as follows:
Increased endogenous insulin sensitivity:
o Biguanides: metformin (MET).
o Thiazolidinediones: pioglitazone (PIO).
Increased endogenous insulin secretion/release:
o Sulfonylureas (SU).
o Meglitinides: repaglinide (REP)
Reduction of digestive glucose absorption:
o Alpha-glucosidase inhibitors.
o Vegetable fiber and derivatives.
Incretin effect enhancers.
o Inhibitors of DPP-4 (iDPP-4).
o GLP-1 analogues (aGLP-1).
Inhibitors of renal glucose reuptake
o SGLT-2 inhibitors (iSGLP-2)
'''
treat_df = pd.read_csv(treat_df_path)
param_df = pd.read_csv(param_df_path)
patients_df = pd.read_csv(patients_df_path)
code2drug = pd.read_csv(code2drug_path,index_col=0).to_dict()
dx_date = dict(zip(patients_df.patient_id,patients_df.dx_date))
fin_date = dict(zip(patients_df.patient_id,patients_df.fin_date))
treat_min_days = 2
days_error = 60-treat_min_days
days_after_m = 7
# definir los tipos de medicamentos a analizar
col_atc = 'atc_code'
col_nac = 'nac_code'
col_id = 'patient_id'
col_date = 'dispensing_date'
col_name = 'Event'
treat_df.index = range(len(treat_df))
treat_df[col_name] = [code2drug.get(treat_df[col_atc][c],{'class' : 'ERROR'}
).get('class','ERROR2').replace('+','&'
) if treat_df[col_atc][c].startswith('A'
) else None for c in range(
len(treat_df[col_atc])) ]
treat_df = treat_df.drop(labels=[i for i in range(len(treat_df)
) if treat_df[col_name][i] in [None,'ERROR','ERROR2']],
axis=0)
treat_df.index = range(len(treat_df))
treat_df[col_nac] = treat_df[col_nac].astype(str)
nac_dict = nac2comprimidos(nac_path,set(treat_df[col_nac]))
treat_df[col_nac] = treat_df[col_nac].apply(
lambda x: nac_dict.get(x,days_error))
treat_df['containers_number'] = treat_df['containers_number'].fillna(1)
treat_df[col_nac] = treat_df[col_nac]*treat_df['containers_number']
treat_df = treat_df.sort_values(by = [col_id,col_date ],
ascending = [True, True])
treat_df.rename(columns = {col_date:'date'}, inplace = True)
treat_df = treat_df[[col_id,'date',col_name,col_nac]]
col_param = 'param_name'
col_value = 'param_value'
col_date = 'param_date'
param_df = param_df.drop(labels=[i for i in range(len(param_df)
) if param_df[col_param][i]!='hba1c'],
axis=0)
param_df[col_value] = ['HBA<75' if i<7.5 else 'HBA>75' for i in list(
param_df[col_value])]
param_df.rename(columns = {col_value: col_name,col_date:'date'},
inplace = True)
param_df = param_df[[col_id,'date',col_name]]
df = pd.concat([param_df,treat_df])
df = df.sort_values(by = [col_id, 'date'], ascending = [True, True])
df.index = range(len(df))
df['cycle'] = 'start'
df['date'] = pd.to_datetime(df['date'], format='%Y-%m-%d')
df['actins'] = list(df.index)
patient_list = list(set(df[col_id]))
df['nid'] = [patient_list.index(df[col_id][n]) for n in range(len(df))]
df_ = df.copy()
df_['cycle'] = 'end'
for n in range(len(df_)):
if 'HBA' in df_[col_name][n]:
continue
df_['date'][n] += timedelta(days=treat_min_days)
df = pd.concat([df,df_])
df = df.sort_values(by = [col_id, 'actins','cycle'],
ascending = [True, True, False])
df.index = range(len(df))
df[col_nac].fillna(0,inplace=True)
#####################################################################
event_log = dict()
event_log['patient_id'] = []
event_log['date'] = []
event_log['nid'] = []
event_log['Event'] = []
event_log['cycle'] = []
id_list = list(set(df['patient_id']))
events = list(set([drug for e in set(df['Event']) for drug in e.split('&')]))
hba0, hba1 = events.index('HBA<75'), events.index('HBA>75')
row = len(events)
no_hba = [i for i in range(row) if i not in [hba0,hba1]]
for id in tqdm(id_list):
df_id = df[df['patient_id']==id]
nid = id_list.index(id)
actins = set(df_id['actins'])
date_min = min([min(set(df_id['date'])),
datetime.strptime(dx_date[id], "%Y-%m-%d")])
date_max = max(set(df_id['date']))
if str(fin_date[id])!='nan':
date_max = max([date_max,datetime.strptime(fin_date[id], "%Y-%m-%d")])
else:
date_max = max([date_max,datetime.strptime('2023-01-01', "%Y-%m-%d")])
dd = [date_min + timedelta(days=x) for x in range(
(date_max-date_min).days + 1)]
col = len(dd)
ev_status = np.zeros((row,col))
max_nac = int(max(df_id[col_nac])*1.2)
#event matrix where columns are days and rows treatments
#0:no treatment, 1:treatment, 2:posible treatment,
#3:after measuring period
for act in actins:
ini = list(df_id['date'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]
ev = list(df_id['Event'][df_id['actins']==act])[0]
fin = ini+timedelta(days=treat_min_days) if 'HBA' not in ev else ini
ev_cols = [events.index(ev_) for ev_ in ev.split('&')]
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(ini),
dd.index(fin)+1)),1)
if 'HBA' not in ev:
possible_day_duration = int(list(
df_id['nac_code'][np.logical_and(df_id['actins']==act,
df_id['cycle']=='start')])[0]*1.2)
until_day = min([dd.index(ini)+possible_day_duration,col])
ev_status = change_matrix_values(ev_status,ev_cols,
list(range(dd.index(fin),
until_day)),2)
measures = list(np.concatenate((ev_status[hba0,:].nonzero()[0],
ev_status[hba1,:].nonzero()[0])))
#event compaction.
#Example: If a patient has 'A' treatment and they are dispensing 'A'
# drug multiple times, their trace could appear as A>A>A>A,
# but the true trace should appear as 'A' treatment lasting
# its corresponding time. Therefore, if the difference between
# the last day and the first day of two consecutive same drugs's
# dispensation is less than the number of tablets of the
# container two events are jointed in one with the first day of
# the first event as initial date and last day of the last
# event as the end date.
ev_status = ev_status.astype(int)
for i in no_hba:
seq = str(list(ev_status[i,:]))
for delta_days in range(1,max_nac):
pattern = ', '+str([1]+[2]*delta_days+[1])[1:-1]
new_pattern = ', '+str([1]+[1]*delta_days+[1])[1:-1]
seq = seq.replace(pattern,new_pattern)
ev_status[i,:] = eval(seq)
ev_status = change_matrix_values(ev_status,no_hba,measures,3)
for i,j in list(itertools.combinations(no_hba,2)):
seq_i = str(list(ev_status[i,:]))
seq_j = str(list(ev_status[j,:]))
for delta_days in range(1,max_nac):
pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_i,seq_i)==None:
continue
pattern_j = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_j = seq_j.replace(pattern_j,new_pattern_j)
for delta_days in range(1,max_nac):
pattern_j = ', '+str([1]+[1]*delta_days+[3])[1:-1]
if re.search(pattern_j,seq_j)==None:
continue
pattern_i = ', '+str([1]+[2]*delta_days+[3])[1:-1]
new_pattern_i = ', '+str([1]+[1]*delta_days+[3])[1:-1]
seq_i = seq_i.replace(pattern_i,new_pattern_i)
ev_status[i,:] = eval(seq_i)
ev_status[j,:] = eval(seq_j)
measures_w_dp = [[i+d_p for d_p in range(days_after_m+1)
if i+d_p<col] for i in measures]
measures_w_dp = list(itertools.chain(*measures_w_dp))
ev_status = change_matrix_values(ev_status,no_hba,measures_w_dp,3)
for i in no_hba:
seq = str(list(ev_status[i,:]))
seq = seq.replace('2','0').replace('3','0')
ev_status[i,:] = eval(seq)
col = dd.index(datetime.strptime(dx_date[id], "%Y-%m-%d"))
col_0 = dd.index(datetime.strptime(dx_date[id], "%Y-%m-%d"))
col_max = dd.index(datetime.strptime(fin_date[id], "%Y-%m-%d")
) if str(fin_date[id])!='nan' else len(dd)
min_change_days = 42
while col<col_max-1:
if not np.array_equal(ev_status[:,col],ev_status[:,col+1]):
ev = col2treat(ev_status,events,col_0)
if (ev=='_' and col-col_0<min_change_days):
col+=1
col_0 = col
continue
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
col+=1
col_0 = col
else:
col+=1
ev = col2treat(ev_status,events,col_0)
if ev!='_' or col-col_0>=min_change_days:
event_log['patient_id'].extend([id,id])
event_log['date'].extend([dd[col_0],dd[col]])
event_log['Event'].extend([ev,ev])
event_log['nid'].extend([nid,nid])
event_log['cycle'].extend(['start','end'])
evlog = pd.DataFrame.from_dict(event_log)
evlog = evlog.sort_values(['patient_id','date'])
evlog.rename(columns = {'patient_id':'ID'}, inplace = True)
evlog.index = range(len(evlog))
evlog['cycle'] = ['start','end']*int(0.5*len(evlog))
evlog['actins'] = [n//2 for n in range(len(evlog))]
evlog.to_csv(output_file)
Distances
One of the most important aim of process mining is to show and explain the processes. However, the great variety of traces does not allow us to draw any clear conclusions and it is often necessary to simplify our data. Another option that we can do before simplifying, to avoid the excessive losing of information and give another perspective to the analysis, is to cluster the traces and to analyze them by cluster. To do that we have to measure somehow the differences between the traces, the distance between them. This is not evident if we take into account that traces are categorical sequences of different lengths. Nevertheless, there are some distances that we can use to this task: edit distances, vector term similarity, LDA based distances, dynamic time warping, embedding based distances… Some of them are shown below as functions to calculate the distance matrix of traces:
Code
####### EDIT DISTANCE #######
def calculate_dm_ED(traces,measure_f):
'''Calculate distance matrix with some edit distance.
Args:
traces (list): patients' traces
measure: some edit distance function
Returns:
dm: distance matrix
'''
id2word = corpora.Dictionary(traces)
traces_e = [[id2word.token2id[t[n]] for n in range(len(t))] for t in traces]
traces_e_str = list(set([str(traces_e[i]
) for i in range(len(traces_e))]))
len_t_r = len(traces_e_str)
len_t = len(traces_e)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = measure_f(traces_e[0],traces_e[0])
d_dic = {str(t):dict() for t in traces_e_str}
for i in range(len_t_r):
d_dic[traces_e_str[i]][traces_e_str[i]] = same
for j in range(i+1,len_t_r):
d_ij = measure_f(eval(traces_e_str[i]),
eval(traces_e_str[j]))
d_dic[traces_e_str[i]][traces_e_str[j]] = d_ij
d_dic[traces_e_str[j]][traces_e_str[i]] = d_ij
for i in range(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
t_i = str(traces_e[i])
t_j = str(traces_e[j])
d_ij = d_dic[t_i][t_j]
dm[i][j] = d_ij
dm[j][i] = d_ij
if same == 1:
dm = 1 - dm
return dm
####### TERM VECTOR SIMILARITY #######
def calculate_dm_TV(traces):
'''Calculate distance matrix with term vector similarity.
Args:
traces (list): list of traces
Returns:
dm (array): distance matrix
vectorizer: TfidfVectorizer
X: traces vectorized with TfidVectorizer
'''
corpus = [' '.join(t) for t in traces]
vectorizer = TfidfVectorizer(tokenizer=str.split)
X = vectorizer.fit_transform(corpus)
print('calculatin dm ...')
dm = np.asarray(np.matmul(X.todense(),X.todense().T))
dm = 1 - dm.round(decimals=4)
return dm, vectorizer, X
####### LDA BASED DISTANCE #######
def calculate_dm_LDA(traces,T=10):
'''Calculate distance matrix with LDA model.
Args:
traces (list): list of traces
T (int): number of topics of LDA model
Returns:
dm (array): distance matrix
lda_model: LdaModel
id2word (dict): tokenized events as keys and events by values
'''
# Create Dictionary
id2word = corpora.Dictionary(traces)
# Term Document Frequency
corpus = [id2word.doc2bow(text) for text in traces]
# Make LDA model
lda_model = LdaModel(corpus=corpus,
id2word=id2word,
num_topics=T,
alpha = 1,
eta = 'auto',
random_state = 123)
get_c_topic = np.array(
lda_model.get_document_topics(corpus,minimum_probability = -0.1))
sigma = np.asarray([[get_c_topic[i][j][1]
for j in range(T)] for i in range(len(corpus))])
sigma2 = np.asarray(np.matmul(sigma,sigma.T))
len_t = len(traces)
dm = np.zeros((len_t,len_t), dtype = np.float32)
same = sigma2[0][0]/np.sqrt(sigma2[0][0]*sigma2[0][0])
for i in trange(len_t):
dm[i][i] = same
for j in range(i+1,len_t):
d_ij = sigma2[i][j]/np.sqrt(sigma2[i][i]*sigma2[j][j])
dm[i][j] = d_ij
dm[j][i] = d_ij
dm = 1-dm
return dm, lda_model, id2wordClustering
Once the distance matrix is obtained, we can proceed with the clustering. Each clustering method has its characteristics and its peculiarities, so we cannot choose whatever method we want. For example, we have to consider that our distances do not comply with the triangle inequality so they cannot be considered metrics. Another consideration is that we have a distance matrix and not a data frame in which we apply directly the method. A hierarchical clustering algorithm seems to be a good choice in our case because in addition to the above it allows to choose the optimal number of clusters.
The next code box contains two different functions to choose the optimal number of clusters:
Code
def dendogram(dm,output_png='../../outputs/dendogram.png'):
'''Plot and save dendogram.
Args:
dm (array): distance matrix
output_png (str): saved dendogram's path
'''
dm_condensed = squareform(dm)
matrix = linkage(
dm_condensed,
method='average'
)
sys.setrecursionlimit(10000000)
dn = dendrogram(matrix,truncate_mode='lastp',p=80)
sys.setrecursionlimit(1000)
plt.title('Dendrogram')
plt.ylabel('Distance')
plt.xlabel('Patients traces')
plt.savefig(output_png)
plt.clf()
def kelbow(dm,elbow_metric='distortion',locate_elbow=False,
output_path='../../outputs/'):
'''Plots to choose optimal clusters.
Args:
dm (array): distance matrix
elbow_metric (str): name of the method
locate_elbow (boolean): True if want to return optimal number of clusters
Returns:
k_opt (int)(optional): optimal number of clusters according to method
'''
model = AgglomerativeClustering(metric = "precomputed",
linkage = 'average')
# k is range of number of clusters.
visualizer_ = KElbowVisualizer(model,
k=(2,25),
timings=False,
xlabel = 'cluster numbers',
metric=elbow_metric,
locate_elbow=locate_elbow)
# Fit data to visualizer
output_file = output_path+elbow_metric+'.png'
visualizer_.fit(dm)
# Finalize and render figure
#visualizer_.show(output_path=output_file,clear_figure=False)
visualizer_.show(output_path=output_file)
k_opt=None
if locate_elbow:
k_opt = visualizer_.elbow_value_
return k_optDepending on the results of the clustering we may need to focus our analysis in the biggest clusters and forget those clusters that contains too less traces. The next function is made for that:
Code
def small_cluster_filter(log,min_traces=30,col_id='ID',col_clust='cluster'):
'''Filter smallest clusters' traces.
Args:
log (dataframe): event log
min_traces (int): minimum number of traces in the resulted clusters
col_id (str): patients id column's name in df_
col_clust (str): clusters column's name in df_
Returns:
filtered_log (dataframe): event log without clusters with les than
min_traces number of traces
'''
drop_clust = [i for i in set(log[col_clust]) if
len(set(log[col_id][log[col_clust]==i]))<min_traces]
filtered_log = log.drop(log[log[col_clust].isin(drop_clust) == True].index)
filtered_log.index = range(len(filtered_log))
return filtered_log
The function to clust is shown in the code below:
Code
def clust(clust_n,dist_matrix,df_,id2trace,patients,id_col='ID',
ev_col='Event',date_col='date',output_xes='../../outputs/eventlog.xes',
output_csv='../../outputs/eventlog.csv'):
'''clusterize distance matrix.
Args:
clust_n (int): number of clusters obtained
dist_matrix (array): distance matrix
df_ (dataframe): event log
id2trace (dict): patient ids as keys and their traces as values
patients (list): patients' ids in same order as in dm
id_col (str): patients id column's name in df_
ev_col (str): events column's name in df_
date_col (str): events dates column's name in df_
'''
model = AgglomerativeClustering(n_clusters=clust_n,
metric = "precomputed",
linkage = 'average')
model.fit(dist_matrix)
labels = model.labels_
cluster_list ={id: labels[traces.index(id2trace[id])
] for id in patients}
df_['cluster'] = [cluster_list[df_[id_col][i]] for i in range(len(df_))]
df_.to_csv(output_csv)
df_filtered = small_cluster_filter(df_)
if len(df_filtered)==0:
df_filtered = df_.copy()
df_filtered.to_csv(output_csv.replace('eventlog','eventlog_filtered'))
df_xes = pm4py.format_dataframe(df_,
case_id=id_col,
activity_key=ev_col,
timestamp_key=date_col)
event_log = pm4py.convert_to_event_log(df_xes)
pm4py.write_xes(event_log, output_xes)
df_filtered_xes = pm4py.format_dataframe(df_filtered,
case_id=id_col,
activity_key=ev_col,
timestamp_key=date_col)
event_log_filtered = pm4py.convert_to_event_log(df_filtered_xes)
pm4py.write_xes(event_log_filtered,output_xes.replace('eventlog',
'eventlog_filtered'))Descriptive analysis
The implementation below is made to show the most frequent traces in each cluster:
Code
def make_data_dict(log,top_k,col_id):
'''Obtain most frequent traces and their statistics
Args:
log (dataframe): event log
top_k (int): number of traces want to show
col_id (str): patients id column's name in df_
Returns:
data_dict (dict): traces as keys and ther statistics as values
'''
len_id = len(set(log[col_id]))
log_freq = pm4py.stats.get_variants(log)
freq_list = [(t,log_freq[t],len(t)) for t in set(log_freq.keys())]
trace = [list(t[0]) for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
cases = [t[1] for t in sorted(freq_list, key=lambda x:
(len_id-x[1],x[2]))[:top_k]]
top_k = min(top_k,len(cases))
percentage = [100*cases[c]/len_id for c in range(top_k)]
cum_percentage = [sum(percentage[:p+1]) for p in range(top_k)]
data_dict = {"Trace": trace,
"Percentage": percentage,
"Cases": cases,
"Cumulative Percentage": cum_percentage}
return data_dict
def update_color_dict(color_dict,data_dict):
'''update of the color dict to include new events
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and ther statistics as values
Returns:
color_dict (dict): events as keys and colors as values
'''
cmap = plt.cm.get_cmap('tab20')
for event in set(itertools.chain.from_iterable(data_dict['Trace'])):
if event not in color_dict and len(color_dict)==20:
cmap = plt.cm.get_cmap('tab20b')
if event not in color_dict:
try:
color_dict.update({event:cmap(len(color_dict))})
except:
color_dict.update({event:cmap(2*(len(color_dict)-20))})
return color_dict
def trace_plotter(data_dict,color_dict,acronym, output_file, font_size=10,
percentage_box_width=0.8,size=(15,9)):
'''configuration of the trace_explorer plot
Args:
color_dict (dict): events as keys and colors as values
data_dict (dict): traces as keys and their statistics as values
acronym (dict): events as keys and their acronyms as values
output_file (str): figure's path
font_size (int): font size
percentage_box_width (float): event boxes' width
size (tuple): figure's size
'''
fig, ax = plt.subplots(figsize=size)
percentage_position = max(len(t) for t in data_dict["Trace"]
) + percentage_box_width*3 +0.5
for row, (trace, percentage,cases,cum_percentage
) in enumerate(zip(data_dict["Trace"],
data_dict["Percentage"],
data_dict["Cases"],
data_dict["Cumulative Percentage"]),
start=1):
for col, acr in enumerate(trace, start=1):
ax.add_patch(plt.Rectangle((col - 0.5, row - 0.45), 1, 0.9,
facecolor=color_dict[acr],
edgecolor='white'))
ax.text(col,
row,
acr,
ha='center',
va='center',
color='white',
fontsize = font_size,
fontweight='bold')
ax.add_patch(plt.Rectangle((
percentage_position -percentage_box_width*2.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width*2,
row,
f'{percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.add_patch(plt.Rectangle((
percentage_position - percentage_box_width*1.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position-percentage_box_width,
row,
f'{cases}',
ha='center',
va='center',
color='white',
fontsize = font_size+4)
ax.add_patch(plt.Rectangle((percentage_position-percentage_box_width*0.5,
row - 0.45), percentage_box_width, 0.9,
facecolor='grey', edgecolor='white'))
ax.text(percentage_position,
row,
f'{cum_percentage:.2f}%',
ha='center',
va='center',
color='white',
fontsize = font_size+2)
ax.set_xlim(0.5, percentage_position+0.5)
ax.set_xticks(range(1, int(percentage_position-1)))
ax.set_ylabel('Traces',fontsize = font_size+3)
ax.set_ylim(len(data_dict["Trace"]) + 0.45, 0.55) # y-axis is reversed
ax.set_yticks([])
ax.set_xlabel('Activities',fontsize = font_size+3)
handles = [plt.Rectangle((0, 0), 0, 0, facecolor=color_dict[acr],
edgecolor='black', label=acronym[acr])
for acr in acronym if acr in set(
itertools.chain.from_iterable(data_dict['Trace']))]
ax.legend(handles=handles,
bbox_to_anchor=[1.02, 1.02],
loc='upper left',
fontsize = font_size+6)
for dir in ['left', 'right', 'top']:
ax.spines[dir].set_visible(False)
plt.tight_layout()
plt.savefig(output_file)
plt.close()
def trace_explorer(evlog_file='../../outputs/eventlog.xes',top_k=5,
id_col='ID',
ev_col='Event',date_col='date',clust_col='cluster',
color_dict={},cond_code=''):
'''Plot each clusters most frequent traces
Args:
evlog_file (str): events as keys and colors as values
top_k (int): traces as keys and their statistics as values
id_col (str): patients id column's name in evlog_file
ev_col (str): events column's name in evlog_file
date_col (str): events dates column's name in evlog_file
clust_col (str): cluster column's name in evlog_file
color_dict (dict): events as keys and colors as values
cond (str) : predominal clinical condition's code
'''
log_ = pm4py.read_xes(evlog_file)
log_ = log_.sort_values([id_col,date_col])
log_ = log_[log_['cycle']=='start']
for clust in set(log_[clust_col]):
log = log_[log_[clust_col]==clust]
len_id = len(set(log[id_col]))
acronym = {t:t for t in sorted(set(log[ev_col]))}
data_dict = make_data_dict(log,top_k,id_col)
color_dict = update_color_dict(color_dict, data_dict)
trace_plotter(data_dict,color_dict,acronym,
'../../outputs/t_cluster_%s_%i.png' % (cond_code,clust))
return color_dictTo get the process maps of each cluster next R functions can be used:
Code
load_csv_log <- function(evlog_csv,case_id="ID",activity_id="Event",
lifecycle_id="cycle", activity_instance_id="actins",
timestamp="date"){
eventlog = read.csv(evlog_csv, header=TRUE, sep = ",")
eventlog = eventlog[order(eventlog$ID),]
#Para transformar fecha a un formato con el que podemos trabajar
eventlog$date = as.POSIXct(eventlog$date, tz = "", format="%Y-%m-%d" ,
tryFormats = c("%Y/%m/%d",
"%Y/%m/%d"),
optional = FALSE)
evLog = eventlog %>%
mutate(resource = NA) %>%
mutate(cycle = fct_recode(cycle, "start" = "start","complete" = "end")) %>%
eventlog(
case_id = case_id,
activity_id = activity_id,
lifecycle_id = lifecycle_id,
activity_instance_id = activity_instance_id,
timestamp = timestamp,
resource_id = 'resource'
)
return(evLog)
}
make_process_map <- function(log,t_freq,output_file){
log %>%
filter_activity_frequency(percentage = 1) %>% # show only most frequent
filter_trace_frequency(percentage = t_freq) %>%
process_map(type_nodes = performance(mean,units='days'),
type_edges = frequency('relative_case'),
sec_edges = performance(mean,units='days'),
render = T) %>%
export_svg %>%
charToRaw %>%
rsvg_png (output_file,width=2000)
}
process_map_by_cluster <- function(evLog,t_freq,cond_code){
for (clust in unique(evLog$cluster)) {
log <- evLog %>%
filter(cluster == clust)
make_process_map(log,t_freq,gsub("src/analysis-scripts/",
"",here("outputs",sprintf(
"pm_cluster_%s_%d.png",
cond_code, clust) )))
}
}Conformance checking
Conformance checking is ‘A’ technique used to check process compliance by comparing event logs for a discovered process with the existing reference model (target model) of the same process. Basing on the DM2 treatment algorithm previous shown, with a software called Carassius , we created the next Petri Nets that are going to be useful as treatment guidelines in reference to glycated hemoglobin measures to patients with frailty.
Fitness is the metric that measures how much a trace is distanced from a given process model, or from the guidelines in this case. There are different methods to calculate this metric but in the code below is used the aligned fitness. Since in this metric the initial marking and the final marking have to be fixed we included the events ‘INI’ and ‘FIN’ in the Petri Net and in each trace. Adding this artificial events allows us to compare all traces fitness in the same conditions.
Code
def id2fitness(log ,net, initial_marking, final_marking, clust_col='cluster',
date_col='date',ev_col='Event'):
'''Obtain traces fitness
Args:
log (dataframe): event log
net: petri net
initial_marking: initial place in the petri net
final_marking: final place in the petri net
clust_col (str): cluster column's name in log
date_col (str): events dates column's name in log
ev_col (str): events column's name in log
Returns:
df (dataframe): traces, their clusters and their fitnesses
'''
id2trace = {id:list(log['Event'][log['case:concept:name']==id]
) for id in set(log['case:concept:name'])}
id2ids = {id:[id2 for id2 in set(id2trace.keys()) if id2trace[id]==id2trace[id2]
] for id in set(log['case:concept:name'])}
for id in set(id2ids.keys()):
try:
for id2 in id2ids[id]:
if id2!=id:
del id2ids[id2]
except:
continue
id2fitness = dict()
for name in set(id2ids.keys()):
log2 = log.drop(log.index[log['case:concept:name'] !=name])
new = log2.copy().iloc[[0, -1]]
date_list = list(new[date_col])
index = list(new['@@index'])
actins = list(new['actins'])
clust = new[clust_col].values[0]
new[date_col] = new['time:timestamp'] = [date_list[0]-timedelta(days=1),
date_list[1]+timedelta(days=1)]
new[ev_col] = new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
log2 = pd.concat([log2,new])
aligned_fitness = pm4py.conformance_diagnostics_alignments(
log2, net, initial_marking, final_marking)[0]['fitness']
for id in id2ids[name]:
id2fitness[id] = {'ID':id,
'aligned_fitness':aligned_fitness,
clust_col: clust}
df = pd.DataFrame(id2fitness).T
df.index = range(len(df))
return dfIn the next function is shown a boxplot function to show clusters’ fitness distribution.
Code
def conformance(xes_file,pn_file,pn_png_file,ini_place='place100',
fin_place='place111', date_col='date',ev_col='Event',
clust_col='cluster',
output_png='../../outputs/fitness_by_cluster.png',
output_csv='../../outputs/fitness_by_cluster.csv'):
'''Barplot of the fitness of each cluster
Args:
xes_file (dataframe): event log's path
pn_file: petri net's path
ini_marking: initial place in the petri net
fin_marking: final place in the petri net
date_col (str): events dates column's name in log
clust_col (str): cluster column's name in log
ev_col (str): events column's name in log
output_png (str): created figure's path
output_csv (str): created csv's path
'''
log = pd.DataFrame(pm4py.read_xes(xes_file))
log = log[log['cycle']=='start']
log = log.sort_values(date_col)
log.index = range(len(log))
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
pm4py.save_vis_petri_net(net, initial_marking, final_marking,
pn_png_file)
df = id2fitness(log,net,initial_marking,final_marking,
clust_col,date_col,ev_col)
df.to_csv(output_csv)
data = [list(df['aligned_fitness'][df[clust_col]==i])
for i in sorted(set(df[clust_col]))]
fig = plt.figure(figsize =(10, 7))
# Creating axes instance
ax = fig.add_axes([0, 0, 1, 1])
# Creating plot
bp = ax.boxplot(data,labels=[i for i in sorted(set(df[clust_col]))])
plt.xlabel("Clusters")
plt.ylabel("Aligned Fitness")
# show plot
plt.savefig(output_png,bbox_inches='tight')
Decision mining
Decision mining allows to analyze the event transitions in different part of processes. With another words, we can measure patients’ characteristics or their relevance in a specific place of the passed petri net. The next function makes a decision tree explaining how patients characteristics are taking into account. Moreover it shows a boxplot of the relevance of each feature in the decision tree obtained with mean decrease impurity which calculates each feature importance as the sum over the number of splits (across all tress) that include the feature, proportionally to the number of samples it splits.
Code
def decision_tree_pn_place(patients_df,
col_id = 'patient_id',
features_list = ['age','sex','copayment'],
event_log_file="../../outputs/eventlog.xes",
pn_file="./PN.pnml",
place2analyze='place9',
ini_place='place100',
fin_place='place111',
cond_code=''):
'''Decision tree and features importances in selected place of PN
Args:
patients_df (dataframe): patients data wanted to analyze
col_id (str): patients id column's name in df
features_list (list): features wanted to analyze
event_log_file (str): event log's path
pn_file (str): petri net's path
place2analyze (str): place wanted to analyze in petri net
ini_place (str): initial place in the petri net
fin_place (str): final place in the petri net
cond_code (str): predominant clinical condition's code
'''
log = pm4py.read_xes(event_log_file)
log = log[log['cycle']=='start']
log = log[['concept:name','time:timestamp','case:concept:name']]
ini = min(list(log['time:timestamp']))-timedelta(days=1)
fin = max(list(log['time:timestamp']))+timedelta(days=1)
patients = list(set(log['case:concept:name']))
add_ini_fin = {'case:concept:name' : patients*2,
'time:timestamp':[ini]*len(patients)+[fin]*len(patients),
'concept:name':['INI']*len(patients)+['FIN']*len(patients)}
log = pd.concat([log,pd.DataFrame(add_ini_fin)])
log = log.sort_values(['case:concept:name','time:timestamp'])
log.index = range(len(log))
patients_df = patients_df[features_list + [col_id]]
patients_df = patients_df.fillna(2)
patients_df['copayment'] = patients_df['copayment'].values.astype(int).astype(str)
log_patients = log.merge(patients_df,left_on='case:concept:name',
right_on=col_id,how='left')
log_patients = log_patients[['concept:name', 'time:timestamp',
'case:concept:name', 'age',
'sex','copayment']]
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p)
for p in net.places].index(ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p)
for p in net.places].index(fin_place)]] = 1
try:
X, y, class_names = decision_mining.apply(log_patients,
net,
initial_marking,
final_marking,
decision_point=place2analyze)
clf, feature_names, classes = decision_mining.get_decision_tree(log_patients,
net,
initial_marking,
final_marking,
decision_point=place2analyze)
gviz = tree_visualizer.apply(clf, feature_names, classes)
gviz.save(filename='decision_tree_%s' % cond_code,
directory='../../outputs')
visualizer.view(gviz)
importances = clf.feature_importances_
tree_importances = pd.Series(importances, index=feature_names)
fig, ax = plt.subplots()
tree_importances.plot.bar(ax=ax)
ax.set_title("Feature importances using MDI")
ax.set_ylabel("Mean decrease in impurity")
fig.tight_layout()
fig.savefig('../../outputs/barplot_features_importance_%s.png' % cond_code)
except ValueError as err:
if str(err)!='No objects to concatenate':
UNKNOWN_ERRORTime dependent Cox model
The link between guideline adherence, in terms of performed process measures, and clinical outcomes is a highly demanded issue in diabetes care. A Cox model is a statistical technique for exploring the relationship between the survival of a patient and several explanatory variables. One of the strengths of the Cox model is its ability to encompass covariates that change over time, such as treatment adherence. A time dependent Cox model can be made using each patient trace’s fitness at different time interval. The function below takes an event log and returns each patients’ period’s fitness.
Code
def id2fitness_by_interval(xes_file,pn_file,ini_place='place100',fin_place='place111',
date_n_col='date',id_col='ID',fixed_period_time=90,
output_csv='../../outputs/fitness_by_period.csv'):
'''Barplot of the fitness of each cluster
Args:
xes_file (dataframe): event log's path
pn_file: petri net's path
ini_place: initial place in the petri net
fin_place: final place in the petri net
date_n_col (str): events dates column's name in log
id_col (str): patients identifiers column's name in log
output_csv (str): created csv's path
'''
net, initial_marking, final_marking = pm4py.read_pnml(pn_file)
initial_marking = Marking()
initial_marking[list(net.places)[[str(p) for p in net.places].index(
ini_place)]] = 1
final_marking = Marking()
final_marking[list(net.places)[[str(p) for p in net.places].index(
fin_place)]] = 1
event_log = pd.DataFrame(pm4py.read_xes(xes_file))
event_log = event_log[event_log['cycle']=='start']
baseline = datetime.strptime('2017-01-01', "%Y-%m-%d")
endline = datetime.strptime('2023-01-01', "%Y-%m-%d")
dd = [baseline + timedelta(days=x) for x in range((
endline-baseline).days + 1)]
dd_INI = dd[0]-timedelta(days=1)
dd_FIN = dd[-1]+timedelta(days=1)
event_log['time:timestamp'] = event_log['time:timestamp'].dt.tz_localize(None)
event_log[date_n_col] = event_log[date_n_col].dt.tz_localize(None)
event_log[date_n_col] = event_log[date_n_col].apply(
lambda x: dd.index(x))
event_log[date_n_col] = event_log.groupby(id_col)[date_n_col].apply(
lambda x: x-x.min())
period2fitness = pd.DataFrame()
id_list_df = []
p_start = []
p_end = []
fitness = []
date_0 = []
status = []
id_list = list(set(event_log[id_col]))
for id in tqdm(id_list):
event_log_id = event_log[event_log[id_col]==id]
date_max = max(event_log_id[date_n_col])
date_min = min(event_log_id['time:timestamp'])
n_periods = date_max//fixed_period_time
if date_max<=fixed_period_time:
continue
for n in range(1,n_periods+2):
event_log_id_n = event_log_id[
event_log_id[date_n_col]<n*fixed_period_time]
new = event_log_id.copy().iloc[[0, -1]]
actins = list(new['actins'])
index = list(new['@@index'])
new['time:timestamp'] = [dd_INI,dd_FIN]
new['concept:name'] = ['INI','FIN']
new['@@index'] = [index[0]-1,index[1]+1]
new['actins'] = [actins[0]-1,actins[1]+1]
event_log_id_n = pd.concat([event_log_id_n,new])
aligned_fitness = pm4py.conformance_diagnostics_alignments(
event_log_id_n, net,
initial_marking, final_marking)[0]['fitness']
id_list_df.append(id)
p_start.append((n-1)*fixed_period_time)
p_end_value = n*fixed_period_time
if n==n_periods+1:
p_end_value = date_max
p_end.append(p_end_value)
fitness.append(aligned_fitness)
date_0.append(date_min)
if n!=n_periods+1:
status.append(0)
else:
status.append(1)
period2fitness[id_col] = id_list_df
period2fitness['t_0'] = p_start
period2fitness['t_1'] = p_end
period2fitness['fitness'] = fitness
period2fitness['ini_date'] = date_0
period2fitness['status'] = status
period2fitness.to_csv(output_csv,index=False)Results
Patients without predominant clinical condition
Event log’s creation and description
These results have been carried out with a set of synthetic data previously generated according to data model. Choosing patients without any predominan clinical condition and after some preprocessing we obtain an event log that can be reproduced in the below process map Figure 7. There is shown how the events are connected, the mean number of days patients spent in each event (or treatment), percentage of patients who made each transition and the mean number of days it took to make the transition. However, a spaghetti map is obtained and nothing can be concluded. Therefore, we have to simplify the process map and for example only show the most frequent traces covering 20% of the event log as in Figure 8. Moreover, in Figure 9 we show percentage of patients’ traces each activity is present.
Clustering traces
Once the set of traces to analyze are selected, there is a need to choose a distance measure to clustering. In this example Jaccard similarity is chosen to calculate the distance matrix.
When distance matrix is acquired, we are able to cluster. However, the number of clusters have to be fixed before. With these figures we are able to conclude what could be the optimal number of clusters.
Choosing the optimal number of clusters, too small clusters can appear, but we can exclude those of less than 30 traces to avoid having clusters of low representation. The process map and the most frequent traces of one of these clusters that remain are the followings.


Conformace checking
Once traces are clusterized, with a boxplot is easy to show that each cluster’s behavior with respect to the treatment guides is different. Comparing traces with Figure 6, calculating the fitness (1 being perfect match with the petri net and 0 being the lowest possible fitness) of each trace and grouping by cluster results in_ Figure 12.
Decision mining
In Figure 13 is shown how patients’ age, sex and copayment would influence in prescribing three drugs based treatment. More features could have been added but for simplicity we included those three.
Variables relevance in the previous decision tree is indicated in Figure 14. As we can see, categorical variables are divided into as many parts as the number of categories there are and variables’ total relevance is the sum of its categories’ values.
Time dependent Cox model
Using fitness as treatment adherence measuring and some static characteristics we made a time dependent Cox model to predict some interesting outcome. In the next example, for simplicity, we choose fiteness, age, sex and copayment to try to predict mortality or any predominant clinical condition change, and the summary of it is:
Call:
coxph(formula = Surv(t_0, t_1, status) ~ age + sex + fitness,
data = df)
n= 1526, number of events= 211
(3 observations deleted due to missingness)
coef exp(coef) se(coef) z Pr(>|z|)
age -0.01719 0.98295 0.01280 -1.343 0.17915
sex1 0.05078 1.05210 0.13978 0.363 0.71637
fitness -2.63088 0.07202 0.83461 -3.152 0.00162 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
exp(coef) exp(-coef) lower .95 upper .95
age 0.98295 1.0173 0.95860 1.0079
sex1 1.05210 0.9505 0.79997 1.3837
fitness 0.07202 13.8859 0.01403 0.3697
Concordance= 0.578 (se = 0.021 )
Likelihood ratio test= 12.44 on 3 df, p=0.006
Wald test = 11.92 on 3 df, p=0.008
Score (logrank) test = 12.01 on 3 df, p=0.007
Joint Model
Code
## Joint model lmmr
### Una clase
#datosHBA$edadcent <- datosHBA$Edaddebut-mean(datosstart$Edaddebut)
library(lcmm)
df$ID2 <- as.numeric(as.factor(df$patient_id))
jointm1 <- Jointlcmm(fixed= fitness ~ t_0 ,random=~1, subject="ID2",
survival = Surv(t_1,status)~ age, hazard="Weibull",
hazardtype="PH",ng=1,data=df)
summary(jointm1)Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, random = ~1, subject = "ID2",
ng = 1, survival = Surv(t_1, status) ~ age, hazard = "Weibull",
hazardtype = "PH", data = df)
Statistical Model:
Dataset: df
Number of subjects: 214
Number of observations: 1529
Number of latent classes: 1
Number of parameters: 7
Event 1:
Number of events: 32
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 25
Convergence criteria: parameters= 6.8e-08
: likelihood= 2.9e-05
: second derivatives= 2.7e-09
Goodness-of-fit statistics:
maximum log-likelihood: 1384.61
AIC: -2755.22
BIC: -2731.66
Score test statistic for CI assumption: 0.019 (p-value=0.8895)
Maximum Likelihood Estimates:
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.05391 0.04021 1.341 0.18003
event1 +/-sqrt(Weibull2) 1.19828 0.07593 15.782 0.00000
age -0.03140 0.03322 -0.945 0.34442
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept 0.51642 0.00552 93.602 0.00000
t_0 -0.00028 0.00001 -44.675 0.00000
Variance-covariance matrix of the random-effects:
intercept
intercept 0.00482
coef Se
Residual standard error 0.07153 0.00139
Code
##### 2 grupos latentes #####
## Lineal
jointm2 <- Jointlcmm(fixed= fitness ~ t_0 ,random=~1,mixture=~t_0, subject="ID2",
survival = Surv(t_1,status)~ age, hazard="Weibull",
hazardtype="PH",ng=3,data=df,
B=jointm1)
summary(jointm2)Joint latent class model for quantitative outcome and competing risks
fitted by maximum likelihood method
Jointlcmm(fixed = fitness ~ t_0, mixture = ~t_0, random = ~1,
subject = "ID2", ng = 3, survival = Surv(t_1, status) ~ age,
hazard = "Weibull", hazardtype = "PH", data = df)
Statistical Model:
Dataset: df
Number of subjects: 214
Number of observations: 1529
Number of latent classes: 3
Number of parameters: 15
Event 1:
Number of events: 32
Proportional hazards over latent classes and
Weibull baseline risk function
Iteration process:
Convergence criteria satisfied
Number of iterations: 60
Convergence criteria: parameters= 1.1e-06
: likelihood= 3e-06
: second derivatives= 3.3e-07
Goodness-of-fit statistics:
maximum log-likelihood: 1538.19
AIC: -3046.38
BIC: -2995.89
Score test statistic for CI assumption: 5.988 (p-value=0.0144)
Maximum Likelihood Estimates:
Fixed effects in the class-membership model:
(the class of reference is the last class)
coef Se Wald p-value
intercept class1 -0.30185 0.21432 -1.408 0.15902
intercept class2 -0.56678 0.26816 -2.114 0.03455
Parameters in the proportional hazard model:
coef Se Wald p-value
event1 +/-sqrt(Weibull1) 0.04048 0.02141 1.891 0.05865
event1 +/-sqrt(Weibull2) 1.52285 0.09562 15.926 0.00000
event1 SurvPH class1 -4.14374 0.95392 -4.344 0.00001
event1 SurvPH class2 -2.49583 0.61407 -4.064 0.00005
age 0.01133 0.03849 0.294 0.76850
Fixed effects in the longitudinal model:
coef Se Wald p-value
intercept class1 0.45948 0.01124 40.887 0.00000
intercept class2 0.60072 0.01744 34.440 0.00000
intercept class3 0.56981 0.00957 59.511 0.00000
t_0 class1 -0.00021 0.00001 -28.359 0.00000
t_0 class2 -0.00038 0.00002 -18.689 0.00000
t_0 class3 -0.00068 0.00003 -23.168 0.00000
Variance-covariance matrix of the random-effects:
intercept
intercept 0.00403
coef Se
Residual standard error 0.05946 0.00122
Code
postprob(jointm2)
Posterior classification based on longitudinal and time-to-event data:
class1 class2 class3
N 68.00 43.00 103.00
% 31.78 20.09 48.13
Posterior classification table:
--> mean of posterior probabilities in each class
prob1 prob2 prob3
class1 0.8506 0.1018 0.0476
class2 0.0805 0.8195 0.1000
class3 0.0709 0.1017 0.8275
Posterior probabilities above a threshold (%):
class1 class2 class3
prob>0.7 76.47 76.74 77.67
prob>0.8 72.06 53.49 61.17
prob>0.9 55.88 39.53 48.54
Posterior classification based only on longitudinal data:
class1 class2 class3
N 72.00 43.00 99.00
% 33.64 20.09 46.26
Code
d1 <- data.frame(t_0=seq(1,1000,length.out=100))
data.new <-data.frame(d1,age=0)
jointm2.pred <- predictY(jointm2, data.new , var.time = "t_0")
plot(jointm2.pred, bty = "l", ylim = c(0, 1), legend.loc = FALSE,
ylab = "fitness", xlab = "days", lwd = 2,col=c("red","navy","green"))Code
plot(jointm2, which = "survival", lwd = 2, legend.loc = FALSE, bty = "l",
xlab = "days", ylab = "event-free probability",col=c("red","navy","green"))Code
table(jointm2$pprob$class,jointm2$pprobY$class)
1 2 3
1 66 1 1
2 1 41 1
3 5 1 97
CV Disease
Event log’s creation and description
Choosing patients with a cardiovascular disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Traceback (most recent call last):
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\backends\backend_qt.py", line 468, in _draw_idle
self.draw()
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\backends\backend_agg.py", line 400, in draw
self.figure.draw(self.renderer)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 95, in draw_wrapper
result = draw(artist, renderer, *args, **kwargs)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 72, in draw_wrapper
return draw(artist, renderer)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\figure.py", line 3140, in draw
mimage._draw_list_compositing_images(
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\image.py", line 131, in _draw_list_compositing_images
a.draw(renderer)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\artist.py", line 72, in draw_wrapper
return draw(artist, renderer)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axes\_base.py", line 3028, in draw
self._update_title_position(renderer)
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axes\_base.py", line 2961, in _update_title_position
if (ax.xaxis.get_ticks_position() in ['top', 'unknown']
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axis.py", line 2451, in get_ticks_position
self._get_ticks_position()]
File "C:\PROGRA~3\ANACON~1\envs\python39\lib\site-packages\matplotlib\axis.py", line 2156, in _get_ticks_position
minor = self.minorTicks[0]
IndexError: list index out of range
Conformace checking
Decision mining
Heart Failure
Event log’s creation and description
Choosing patients with heart failure and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining


Chronic kidney disease
Event log’s creation and description
Choosing patients with chronic kidney disease and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.
Clustering traces
Conformace checking
Decision mining


Frailty
Event log’s creation and description
Choosing patients with frailty and after some preprocessing we obtain a new event log to make the same analysis it has done to patients without any predominant clinical condition.